Skip to content

fix: report correct reason in kube_pod_status_reason metric #2644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

carlosmorenokm1
Copy link

What this PR does / why we need it:
This PR updates the logic for generating the kube_pod_status_reason metric. Instead of only checking p.Status.Reason, the new implementation also verifies the pod conditions and the termination reasons of container statuses. This change fixes an issue where the metric always returned 0, even when a pod had a valid status reason (such as "Evicted", "NodeLost", etc.), leading to inaccurate monitoring data. Accurately reporting these values is crucial for diagnosing pod behavior and overall cluster health.

How does this change affect the cardinality of KSM:
It does not change the cardinality. The update only adjusts the value calculation for an existing metric family, so no new labels or metric series are introduced.

Which issue(s) this PR fixes:
Fixes #2612

Copy link

linux-foundation-easycla bot commented Apr 1, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: carlosmorenokm1
Once this PR has been reviewed and has the lgtm label, please assign rexagod for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 1, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @carlosmorenokm1!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2025
@carlosmorenokm1 carlosmorenokm1 force-pushed the fix-pod-status-reason branch from 23f9138 to 518db3d Compare April 1, 2025 04:01
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 1, 2025
Comment on lines +1563 to +1567
for _, cond := range p.Status.Conditions {
if cond.Reason == reason {
return 1
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only care about the last condition? If so, do we need to remove this part?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's necessary to iterate through all the conditions because the reason may be in any of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be a stale condition?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will not be a stale condition. Kubernetes regularly updates Pod conditions, so if a condition with the corresponding reason is found, it is assumed to be current. If a stale condition were detected, that would indicate an issue in Kubernetes, not in this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a pod have multiple different reasons?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a Pod can have different “Reasons” throughout its lifecycle. Each event or change in the Pod’s state (for example, container creation, image pulling, runtime errors, restarts, etc.) can trigger a different reason. In Kubernetes, these “Reasons” are recorded at different points in the Pod’s lifecycle, so it is entirely possible for a single Pod to go through multiple different “Reasons” as it transitions between states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking the case where the pod status is failed to image, then runtime errors, then restart.

Will the above metric have all of these three status?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if your Pod transitions through those states (e.g., failed to pull image, runtime errors, then restarts), the metric can capture each corresponding reason at the time it occurs. However, you won’t necessarily see all reasons simultaneously; rather, you’ll see them reflected as changes in the metric over the Pod’s lifecycle.

@CatherineF-dev
Copy link
Contributor

Could we update the title to be "fix: report correct reason in kube_pod_status_reason metric"

@carlosmorenokm1 carlosmorenokm1 changed the title fix: report correct values in kube_pod_status_reason metric fix: report correct reason in kube_pod_status_reason metric Apr 1, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube_pod_status_reason is 0 for all reasons
6 participants